fix(build): probe CUDA toolkit layouts in a shared openinfer-build crate by FeathBow · Pull Request #343 · openinfer-project/openinfer

FeathBow · 2026-06-10T18:36:18Z

Description

Fixes #342

CUDA toolkit discovery was duplicated across build scripts, each assuming the classic /usr/local/cuda layout (lib64/ for libs, include/ for headers). Two real layouts break it: conda/micromamba (libs in lib/, headers in targets/<arch>-linux/include/) and the NVIDIA HPC SDK (cuBLAS in a math_libs/<ver> sibling tree). This PR concentrates discovery in a shared openinfer-build crate: find_package probes several check files per root, and cuda_libs probes lib64/lib/targets/<arch>/lib plus the math_libs sibling, emitting only dirs that exist. The evidenced sites (openinfer-kernels, cuda-sys, cudart-sys) migrate to it; gdrapi-sys/libibverbs-sys keep their behavior through the same helper.

Before

Dual-GH200 (aarch64, sm_90), NVIDIA HPC SDK toolkit: linking openinfer-kernels fails; the only workaround was a manual LIBRARY_PATH export.
Single GPU (x86_64, sm_89), conda toolkit: the cuda-sys/cudart-sys build scripts panic on the header probe, taking cargo test --workspace down with them.

Error logs

# Dual-GH200 (aarch64, sm_90), HPC SDK toolkit — openinfer-kernels link stage
ld: cannot find -lcublas: No such file or directory
ld: cannot find -lcublasLt: No such file or directory

# Single GPU (x86_64, sm_89), conda toolkit — openinfer-comm-cuda-sys build script
cuda-sys build error: required header `include/cuda.h` not found.
Looked at `$CUDA_HOME` (set to ".../envs/<conda-env>") and default paths ["/usr/local/cuda"]

After

Dual-GH200 (aarch64, sm_90), NVIDIA HPC SDK toolkit: openinfer-kernels relinks with LIBRARY_PATH unset; the workaround is deleted.
Single GPU (x86_64, sm_89), conda toolkit: cuda-sys/cudart-sys build, openinfer-kernels relinks, and the Qwen3-4B golden gate runs green on the fixed tree.
Layout unit tests pass on both machines and on a CUDA-less host (the crate has no CUDA dependency).

Verification logs

== buildfix verify, Dual-GH200 (aarch64, sm_90), HPC SDK toolkit ==
LIBRARY_PATH=unset
=A= openinfer-build unit tests
test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s
=B= kernels relink (gate --no-run)
    Finished `release` profile [optimized] target(s) in 45.60s

== buildfix verify, Single GPU (x86_64, sm_89), conda toolkit ==
=A= openinfer-build unit tests
test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
=B= cuda-sys
    Finished `release` profile [optimized] target(s) in 1.89s
=C= cudart-sys
    Finished `release` profile [optimized] target(s) in 1.23s
=D= kernels relink (gate --no-run)
    Finished `release` profile [optimized] target(s) in 38.85s
=E= golden gate full run
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 38.05s

== buildfix verify, CUDA-less dev host (macOS arm64) ==
test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Type of Change

Bug fix (non-breaking change which fixes an issue)

Checklist

My code follows the style guidelines of this project (see docs/conventions/coding-style.md).
I have performed a self-review of my own code.
I have formatted my commits according to Commitizen conventions.
I have run the local test suite and all tests pass (see CLAUDE.md).

xiaguan · 2026-06-11T07:49:38Z

Review notes from a build-discovery pass:

Thanks for tackling this. The motivation makes sense: CUDA discovery is currently duplicated across build scripts, and the classic /usr/local/cuda/include + /usr/local/cuda/lib64 assumption breaks real conda/micromamba and NVIDIA HPC SDK installs.

The new openinfer-build crate is a good direction. Centralizing find_package, CUDA header candidates, and CUDA library candidates makes the behavior easier to test and reason about. The unit tests for classic, conda-style, and HPC SDK layouts are also useful.

I do not think this fully closes #342 yet, though. The issue lists several CUDA discovery sites, but this PR only migrates part of them. A few active build paths still keep their own CUDA assumptions:

openinfer-comm/crates/openinfer-comm-a2a-kernels/build.rs:31 still hardcodes /usr/local/cuda/lib64.
openinfer-comm/crates/openinfer-comm-torch-lib/build.rs:29 still hardcodes /usr/local/cuda/include.
openinfer-cupti/build.rs:3 still has separate CUDA root/include/lib logic.
kvbm/kvbm-kernels/build.rs:369 still has its own CUDA library path handling.

There is also one partial fix inside openinfer-kernels: the link search path now uses openinfer_build::link_cuda, but the compile/include side still assumes $CUDA_HOME/include at openinfer-kernels/build.rs:1139 and openinfer-kernels/build.rs:1102. That means a conda layout with headers under targets/<arch>-linux/include can still fail for generated Triton wrapper compilation.

My suggestion is to make openinfer-build expose one higher-level CUDA discovery result, for example root, include dirs, lib dirs, stub lib dirs, and nvcc path, then migrate all CUDA build scripts to consume that instead of each caller composing pieces manually.

I verified:

cargo test --release -p openinfer-build --lib
cargo fmt --all --check
cargo metadata --no-deps --format-version 1

I could not run the CUDA build locally because nvcc is not on PATH on this host.

FeathBow · 2026-06-11T12:59:49Z

Thanks for the careful review(and verification)! Agreed on these points, and the include-side catch is a real gap. I'll push a patch to this PR later :)

FeathBow · 2026-06-11T15:57:42Z

Pushed the follow-up covering these points.

Discovery now lives in one openinfer_build::CudaToolkit: root, nvcc, existing include dirs, and existing lib dirs in link order, plus header_dir(h) for host-compiler -I flags and link_search() / link_search_stubs(). All seven sites consume it: openinfer-kernels, cuda-sys, cudart-sys, a2a-kernels, torch-lib, cupti, and kvbm-kernels.

One behavior change to flag: kvbm-kernels used to check CUDA_PATH before CUDA_HOME, the opposite of the rest of the workspace; CudaToolkit resolves CUDA_HOME first, which only matters when both are set to different roots. Happy to special-case it if the old precedence was intentional. cupti also hardcoded targets/x86_64-linux, silently broken on aarch64; this fixes that in passing.

Before (conda toolkit, the include-side gap you predicted, reproduced with the CPATH workaround removed):

Error logs — Single GPU (x86_64, sm_89), conda toolkit

=B1= kernels triton wrapper host-cc compile
triton_flash_attention_prefill_hd256.cd26a890_.c:6:10: fatal error: cuda.h: No such file or directory
=B2= torch-lib hw-cuda
.../torch/include/ATen/cuda/CUDAContextLight.h:9:10: fatal error: cuda_runtime_api.h: No such file or directory
=B3= comm bins real link (blocked behind the kernels failure)
error: failed to run custom build command for `openinfer-kernels v0.1.0 (...)`

After — with the CPATH/LIBRARY_PATH workarounds deleted, kernels, torch-lib (hw-cuda), the openinfer-comm binaries (first full link of pplx_a2a_bench, with gdrapi/ibverbs/cudart all resolved), cupti, and kvbm-kernels all build on the conda machine and the golden gate runs green; layout unit tests pass on all three hosts:

Verification logs

== Single GPU (x86_64, sm_89), conda toolkit, CPATH workaround deleted ==
=A1= openinfer-build unit tests
test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
=A2= kernels (triton wrapper via header_dir)
    Finished `release` profile [optimized] target(s) in 35.88s
=A3= torch-lib hw-cuda
    Finished `release` profile [optimized] target(s) in 9.40s
=A4= comm bins real link
    Finished `release` profile [optimized] target(s) in 56.35s
=A5= cupti
    Finished `release` profile [optimized] target(s) in 0.11s
=A6= kvbm-kernels
    Finished `release` profile [optimized] target(s) in 9.14s
=A7= golden gate full run
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 37.92s

== Dual-GH200 (aarch64, sm_90), HPC SDK toolkit, LIBRARY_PATH unset ==
=G1= openinfer-build unit tests
test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.04s
=G2= kernels relink (gate --no-run)
    Finished `release` profile [optimized] target(s) in 2m 15s

== CUDA-less dev host ==
test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s

…ckage hit

FeathBow force-pushed the fix/build-cuda-discovery branch from a71b566 to 007ddcb Compare June 11, 2026 15:37

FeathBow force-pushed the fix/build-cuda-discovery branch from f0a6138 to affc3bd Compare June 12, 2026 10:09

FeathBow added 3 commits June 12, 2026 11:12

fix(build): probe CUDA toolkit layouts in a shared openinfer-build crate

3a7fb8c

fix(build): migrate every CUDA discovery site to a shared CudaToolkit

4fa715e

fix(build): diagnose missing CUDA root and cuda.h early, test find_pa…

b0191a8

…ckage hit

FeathBow force-pushed the fix/build-cuda-discovery branch from affc3bd to b0191a8 Compare June 12, 2026 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(build): probe CUDA toolkit layouts in a shared openinfer-build crate#343

fix(build): probe CUDA toolkit layouts in a shared openinfer-build crate#343
FeathBow wants to merge 3 commits into
openinfer-project:mainfrom
FeathBow:fix/build-cuda-discovery

FeathBow commented Jun 10, 2026

Uh oh!

xiaguan commented Jun 11, 2026

Uh oh!

FeathBow commented Jun 11, 2026

Uh oh!

FeathBow commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FeathBow commented Jun 10, 2026

Description

Before

After

Type of Change

Checklist

Uh oh!

xiaguan commented Jun 11, 2026

Uh oh!

FeathBow commented Jun 11, 2026

Uh oh!

FeathBow commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants